Implement literal `np.timedelta64` coding #10101

spencerkclark · 2025-03-06T13:39:37Z

This PR implements @shoyer's suggested approach for "literal" coding of np.timedelta64 values. Accordingly, it provides a pathway for roundtripping np.timedelta64 data without a FutureWarning, and preserves the encoding of variables that were encoded on disk with the previous approach.

I still want to reflect a little more on whether we want any more tests, but this seems functional at the moment—i.e. this example runs without a warning¹:

>>> import xarray; import numpy as np
>>> deltas = np.array([1, 2, 3], dtype='timedelta64[D]').astype('timedelta64[s]')
>>> ds = xarray.Dataset({'lead_time': deltas})
>>> xarray.open_dataset(ds.to_netcdf())
<xarray.Dataset> Size: 24B
Dimensions:    (lead_time: 3)
Coordinates:
  * lead_time  (lead_time) timedelta64[s] 24B 1 days 2 days 3 days
Data variables:
    *empty*

@kmuehlbauer let me know if you have any initial thoughts, particularly with respect to possible interaction with other coders.

Closes Timedelta64 data cannot be round-tripped to netCDF files without a warning #10099
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

Note I needed to move away from nanosecond resolution here, since creating bytes with to_netcdf will attempt to cast int64 data to int32 which leads to overflow. ↩

kmuehlbauer · 2025-03-06T14:38:40Z

@spencerkclark Thanks! I'll try to look into this tomorrow. The major interaction issues are with CFMaskCoder/CFScaleOffsetCoder and CFDatetimeCoder. So I do not expect too much issues with other coders here, but I'll check anyway.

xarray/coding/variables.py

shoyer · 2025-03-07T01:18:44Z

Thanks @spencerkclark for taking a look at this! I left a couple of suggestions.

Incidentally, I think everyone will be happier the sooner we can relax the nanosecond precision restriction in Xarray :).

kmuehlbauer · 2025-03-07T12:48:04Z

@kmuehlbauer let me know if you have any initial thoughts, particularly with respect to possible interaction with other coders.

@spencerkclark I do not immediately see any issues with other coders with the current implementation. Taking @shoyer's suggestion into account, we would need to make sure that xarray always gives preference to dtype-attribute instead of units (if both are given).

…rk/xarray into timedelta64-encoding

for more information, see https://pre-commit.ci

shoyer · 2025-03-08T22:41:45Z

Looking at the implementation, maybe we don't need the separate new coder class and could keep this all in one objects?

I think the behavior could probably fit into a single coder:

Encoding time units (to disk):
- Convert data timedelta64 -> integer
- Write both units and dtype attributes
Decoding time units (from disk):
- Convert data: integer -> timedelta64
- If dtype != 'timedelta64', issue a future warning.

spencerkclark · 2025-03-09T00:24:35Z

Thanks @shoyer—indeed it's probably simpler to have this all live in CFTimedeltaCoder, and we can re-use some existing code there. I went ahead and made that change.

The remaining awkward bit relates to interaction with masking and scaling. Since we are overloading the dtype encoding, we do not have a way of retaining the numerical dtype of the data that was written to disk during a round trip. I pushed a failing test in 503db4a to provide an example. Maybe I am just not thinking about this the proper way, however.

spencerkclark · 2025-03-22T19:28:27Z

The remaining awkward bit relates to interaction with masking and scaling. Since we are overloading the dtype encoding, we do not have a way of retaining the numerical dtype of the data that was written to disk during a round trip. I pushed a failing test in 503db4a to provide an example. Maybe I am just not thinking about this the proper way, however.

For now I ended up punting on this in favor of forbidding providing any other encoding parameters when encoding timedeltas this new way. The previous encoding path supports this, and that functionality is still maintained in this PR.

Another approach would be to use a different name for the attribute we assign the timedelta dtype to, say "xarray_dtype" instead of "dtype", but this strays from how things are handled for boolean values, and I am not sure we are losing much by not supporting these other encoding parameters ("_FillValue", "missing_value", "add_offset", and "scale_factor") in this code path. I'm open to discuss, however.

So possibly this is closer now—I added some more tests and laid the groundwork for eventually turning off automatic decoding of variables with time-like units attributes (we would set CFTimedeltaCoder.decode_via_units to False by default instead of True). Ultimately though this is a delicate PR that should be carefully discussed / reviewed.

jorenham · 2025-04-06T03:58:21Z

Just out of curiosity: How will a np.timedelta64("NaT") or np.timedelta64(None), whose .item() is None, be encoded? Because there's no int("nan") or something 🤔.

Also, will it ignore the units of NaT such as np.timedelta64("nat", "as")? I'm not sure what the use-case is, but for some reason NaT's can have a unit a in NumPy.

spencerkclark · 2025-04-06T14:55:18Z

Thanks for taking an interest in this PR @jorenham—under the hood np.timedelta64("NaT") is represented by np.iinfo(np.int64).min, so it is straightforward to roundtrip with this type of encoding:

>>> arr = np.array([1, "NaT"], dtype="timedelta64[s]")
>>> arr.astype(np.int64)
array([                   1, -9223372036854775808])
>>> arr.astype(np.int64).astype("timedelta64[s]")
array([    1, 'NaT'], dtype='timedelta64[s]')

Though maybe for CF convention purposes we should record np.iinfo(np.int64).min as a "_FillValue" by default, and perhaps still allow specifying other fill values. While not necessary for xarray, it could be useful for downstream interpretation. I went ahead and implemented this in 46169ab. "_FillValue" or "missing_value" are now allowed in this encoding pathway—it is expected these will have the same dtype as the encoded values (int64)—while "add_offset" and "scale_factor" still are not.

The units will not be ignored, since the goal of this PR is to faithfully roundtrip np.timedelta64 values including their precise dtype.

kmuehlbauer · 2025-05-07T07:22:48Z

xarray/coding/times.py

+                dtype = np.dtype(dtype)
+                resolution, _ = np.datetime_data(dtype)
+                if resolution not in typing.get_args(PDDatetimeUnitOptions):
+                    raise ValueError(


I tend to agree to raise here. It would be good to have a solution for the user, how to decode in those cases. Could we reduce this to a warning and use the nearest fitting decoding ("s" or "ns")?

Thanks @kmuehlbauer, I kind of went back and forth on this. While it might technically be possible, I sort of lean toward being strict until someone produces a compelling complaint, since enabling it would make round tripping more complicated. I think the only way this situation could come up would be if someone constructed a dataset with a variable with a timedelta64[non-pandas-resolution] dtype attribute outside of xarray, which as you suggest seems unlikely since this is somewhat of an xarray-specific encoding approach.

To enable round tripping in those situations, we would need to decide whether to use the dtype provided in the encoding dictionary when saving a variable to disk, or whether to use the dtype of the array. If we used the dtype from the encoding dictionary (if it existed) without extra guardrails, it would allow users to specify dtype encodings that might be inconsistent with their data, e.g. timedelta64[s] for an array of dtype timedelta64[us]. If they wanted to do that, I would prefer that they do so by casting the data itself, e.g. astype("timedelta64[s]", prior to encoding.

Currently this PR takes the simple approach of always using the dtype of the array and inferring the units from that. We might consider warning if the user for whatever reason specified dtype and units encoding attributes that conflicted with those—right now we just silently ignore them—but in general, my hope is that users would not ever feel the need to try to modify encoding parameters manually with this approach and rather use astype instead.

Does that make sense? Maybe I'm just being overly reluctant.

A compromise could be that we decode and warn that not only the resolution of the array will be different, but also that round tripping will not produce identical results. I pushed that in 0e67a04. Let me know what you think.

Yes, great compromise! Thanks @spencerkclark, 💯!

kmuehlbauer

I think this is a very neat implementation of literal timedelta coding! I left one comment, where I expect possible user issues. But those cases should be rare, I guess.

This reverts commit 0929ec4.

Proof of concept literal timedelta64 coding

063437b

spencerkclark mentioned this pull request Mar 6, 2025

Timedelta64 data cannot be round-tripped to netCDF files without a warning #10099

Open

Ensure test_roundtrip_timedelta_data test uses old encoding pathway

03f2988

spencerkclark added 2 commits March 6, 2025 19:45

Remove no longer relevant test

bdb53d7

Merge branch 'main' into timedelta64-encoding

05c3ce6

shoyer reviewed Mar 7, 2025

View reviewed changes

xarray/coding/variables.py Outdated Show resolved Hide resolved

xarray/coding/variables.py Outdated Show resolved Hide resolved

spencerkclark and others added 11 commits March 8, 2025 09:47

Include units attribute

00d9eaa

Move coder to times.py

b043b45

Merge branch 'main' into timedelta64-encoding

6f4e6e4

Add what's new entry

7f73753

Merge branch 'timedelta64-encoding' of https://github.com/spencerkcla…

4a8e111

…rk/xarray into timedelta64-encoding

Restore test and reduce diff

9ce2a24

Fix typing

eb6e19a

[pre-commit.ci] auto fixes from pre-commit.com hooks

436e588

for more information, see https://pre-commit.ci

Fix doctests

a305238

Restore original order of encoders

b406c64

Add return types to tests

a21b137

spencerkclark added 3 commits March 8, 2025 18:40

Move everything to CFTimedeltaCoder; reuse code where possible

5108b02

Fix mypy

452968c

Use Kai's offset and scale_factor logic for all encoding

503db4a

spencerkclark added 5 commits March 21, 2025 20:46

Merge branch 'main' into timedelta64-encoding

9aee097

Fix bad merge

56f55e2

Forbid mixing other encoding with literal timedelta64 encoding

c5e7de9

Expose fine-grained control over decoding pathways

d1744af

Rename test

7c7b071

spencerkclark added 7 commits March 22, 2025 12:47

Use consistent dtype spelling

da1edc4

Continue supporting non-timedelta dtype-only encoding

2bb4b99

Fix example attribute in docstring

0220ed5

Update what's new

c83fcb3

Fix typo

d1e8a5e

Complete test

7b94d35

Fix docstring

f269e68

dcherian requested a review from shoyer March 29, 2025 17:04

spencerkclark added 2 commits April 6, 2025 10:36

Support _FillValue or missing_value encoding

46169ab

Merge branch 'main' into timedelta64-encoding

3ad0825

Merge branch 'main' into timedelta64-encoding

a697ce4

spencerkclark marked this pull request as ready for review April 22, 2025 18:54

github-actions bot added topic-documentation topic-CF conventions topic-cftime labels Apr 22, 2025

Merge branch 'main' into timedelta64-encoding

ea9b443

dcherian requested a review from kmuehlbauer May 6, 2025 20:47

kmuehlbauer reviewed May 7, 2025

View reviewed changes

kmuehlbauer approved these changes May 7, 2025

View reviewed changes

spencerkclark added 4 commits May 11, 2025 12:00

Tweak errors and warnings; relax decoding dtype error

0e67a04

Add xfail test for fine-resolution branch of non-pandas resolution code

8df1981

Merge branch 'main' into timedelta64-encoding

ed14179

Fix typing

0929ec4

github-actions bot added the topic-typing label May 11, 2025

spencerkclark added 2 commits May 11, 2025 13:40

Revert "Fix typing"

191667f

This reverts commit 0929ec4.

Use simpler typing fix for now

52a7255

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement literal `np.timedelta64` coding #10101

Implement literal `np.timedelta64` coding #10101

spencerkclark commented Mar 6, 2025 •

edited

Loading

kmuehlbauer commented Mar 6, 2025

shoyer commented Mar 7, 2025

kmuehlbauer commented Mar 7, 2025

shoyer commented Mar 8, 2025

spencerkclark commented Mar 9, 2025

spencerkclark commented Mar 22, 2025

jorenham commented Apr 6, 2025

spencerkclark commented Apr 6, 2025

kmuehlbauer May 7, 2025

spencerkclark May 11, 2025

spencerkclark May 11, 2025

kmuehlbauer May 12, 2025

kmuehlbauer left a comment

Implement literal np.timedelta64 coding #10101

Are you sure you want to change the base?

Implement literal np.timedelta64 coding #10101

Conversation

spencerkclark commented Mar 6, 2025 • edited Loading

Footnotes

kmuehlbauer commented Mar 6, 2025

shoyer commented Mar 7, 2025

kmuehlbauer commented Mar 7, 2025

shoyer commented Mar 8, 2025

spencerkclark commented Mar 9, 2025

spencerkclark commented Mar 22, 2025

jorenham commented Apr 6, 2025

spencerkclark commented Apr 6, 2025

kmuehlbauer May 7, 2025

Choose a reason for hiding this comment

spencerkclark May 11, 2025

Choose a reason for hiding this comment

spencerkclark May 11, 2025

Choose a reason for hiding this comment

kmuehlbauer May 12, 2025

Choose a reason for hiding this comment

kmuehlbauer left a comment

Choose a reason for hiding this comment

Implement literal `np.timedelta64` coding #10101

Implement literal `np.timedelta64` coding #10101

spencerkclark commented Mar 6, 2025 •

edited

Loading